Load the necessary libraries
library(car) #for regression diagnostics
library(broom) #for tidy output
library(ggfortify) #for model diagnostics
library(DHARMa) #for residual diagnostics
library(performance) #for residuals diagnostics
library(see) #for plotting residuals
library(sjPlot) #for outputs
library(knitr) #for kable
library(effects) #for partial effects plots
library(ggeffects) #for partial effects plots
library(emmeans) #for estimating marginal means
library(modelr) #for auxillary modelling functions
library(tidyverse) #for data wrangling
Here is an example from Fowler, Cohen, and Jarvis (1998). An agriculturalist was interested in the effects of fertilizer load on the yield of grass. Grass seed was sown uniformly over an area and different quantities of commercial fertilizer were applied to each of ten 1 m2 randomly located plots. Two months later the grass from each plot was harvested, dried and weighed. The data are in the file fertilizer.csv in the data folder.
| FERTILIZER | YIELD |
|---|---|
| 25 | 84 |
| 50 | 80 |
| 75 | 90 |
| 100 | 154 |
| 125 | 148 |
| ... | ... |
| FERTILIZER: | Mass of fertilizer (g.m-2) - Predictor variable |
| YIELD: | Yield of grass (g.m-2) - Response variable |
The aim of the analysis is to investigate the relationship between fertilizer concentration and grass yield.
We will start off by reading in the Fertilizer data. There are many functions in R that can read in a CSV file. We will use a the read_csv() function as it is part of the tidyverse ecosystem.
fert = read_csv('../public/data/fertilizer.csv', trim_ws=TRUE)
##
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## FERTILIZER = col_double(),
## YIELD = col_double()
## )
After reading in a dataset, it is always a good idea to quickly explore a few summaries in order to ascertain whether the imported data are correctly transcribed. In particular, we should pay attention to whether there are any unexpected missing values and ensure that each variable (column) has the expected class (e.g. that variables we expected to be considered numbers are indeed listed as either <dbl> or <int> and not <char>).
glimpse(fert)
## Rows: 10
## Columns: 2
## $ FERTILIZER <dbl> 25, 50, 75, 100, 125, 150, 175, 200, 225, 250
## $ YIELD <dbl> 84, 80, 90, 154, 148, 169, 206, 244, 212, 248
## Explore the first 6 rows of the data
head(fert)
str(fert)
## spec_tbl_df [10 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ FERTILIZER: num [1:10] 25 50 75 100 125 150 175 200 225 250
## $ YIELD : num [1:10] 84 80 90 154 148 169 206 244 212 248
## - attr(*, "spec")=
## .. cols(
## .. FERTILIZER = col_double(),
## .. YIELD = col_double()
## .. )
The individual responses (\(y_i\), observed yields) are each expected to have been independently drawn from normal (Gaussian) distributions (\(\mathcal{N}\)). These distributions represent all the possible yields we could have obtained at the specific (\(i^th\)) level of Fertilizer. Hence the \(i^th\) yield observation is expected to have been drawn from a normal distribution with a mean of \(\mu_i\).
Although each distribution is expected to come from populations that differ in their means, we assume that all of these distributions have the same variance (\(\sigma^2\)).
Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 x_i \]
where \(y_i\) represents the \(i\) observed values, \(\beta_0\) and \(\beta_1\) represent the intercept and slope respectively, and \(\sigma^2\) represents the estimated variance.
We intend to explore the relationship between Yield and Fertiliser using simple linear regression. Such an analysis assumes:
ggplot(fert, aes(y = YIELD, x = FERTILIZER)) +
geom_point() +
geom_smooth()
ggplot(fert, aes(y = YIELD, x = FERTILIZER)) +
geom_point() +
geom_smooth(method = "lm")
ggplot(fert, aes(y = YIELD)) +
geom_boxplot(aes(x = 1))
Conclusions:
Lets start by performing Ordinary Least Squares (OLS) regression by first principles.
To do so, we expresses the response as a vector (\(Y\)), the deterministic (linear predictor) as a matrix (\(\boldsymbol{X}\)) comprising of a column of 1’s (for multiplying the intercept parameter) and the predictor variable (for multiplying the slope parameter). We then also have a vector of beta parameters (\(\beta_0\) and \(\beta\) representing the intercept and slope respectively).
\[ y_i = \beta_0\times 1 + \beta_1\times x_i \]
\[ \underbrace{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}_{Y} = \underbrace{\begin{bmatrix} 1&x_1\\ 1&x_2\\ 1&x_3\\ ...\\ 1&x_i \end{bmatrix}}_{\boldsymbol{X}} \underbrace{\vphantom{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}\begin{bmatrix} \beta_0&\beta \end{bmatrix}}_{\boldsymbol{\beta}} \]
In OLS we estimate the parameters (\(\beta_0\) and \(\beta\)) by minimising the sum of the squared residuals (\(\boldsymbol{\varepsilon}\)).
\[ \begin{align} Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\\ \boldsymbol{\varepsilon} =& Y - \boldsymbol{X}\boldsymbol{\beta} \end{align} \]
We can then minimize the sum of squares residuals. This is essentially \(\boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon}\), where \(\boldsymbol{\varepsilon}\) is \(\boldsymbol{\varepsilon}\) transposed.
\[ \small \begin{bmatrix} \varepsilon_1&\varepsilon_2&\varepsilon_3&...&\varepsilon_n \end{bmatrix} \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \varepsilon_n\\ \end{bmatrix} = \begin{bmatrix} \varepsilon_1\times\varepsilon_1 + \varepsilon_2\times\varepsilon_2 + \varepsilon_3\times\varepsilon_3 + ... + \varepsilon_n\times\varepsilon_n \end{bmatrix} \]
\[ \begin{align} \boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon} =& (Y - \boldsymbol{X}\boldsymbol{\beta})^T(Y - \boldsymbol{X}\boldsymbol{\beta})\\ \boldsymbol{\beta} =& (\boldsymbol{X}^T\boldsymbol{X})^{-1} \boldsymbol{X}^T Y \end{align} \]
## Generate the model matrix
X <- model.matrix(~FERTILIZER, data = fert)
## Solve for beta
beta <- solve(t(X) %*% X) %*% t(X) %*% fert$YIELD
beta
## [,1]
## (Intercept) 51.9333333
## FERTILIZER 0.8113939
The above simple linear regression model can be fit using the lm() function. This function, uses Ordinary Least Squares (OLS). The full model formula would include a 1 (for the intercept) and the predictor values (for the slope). In R, models generally always include an intercept by default and thus we can abbreviate the formula a little by omitting the 1.
fert.lm <- lm(YIELD~1+FERTILIZER, data = fert)
fert.lm <- lm(YIELD~FERTILIZER, data = fert)
The lm() function returns a list comprising the estimated coefficients, the residuals, the data used to fit the model and various other attributes associated with the model fitting procedure.
## Get the name of the attributes within the lm model
attributes(fert.lm)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
## Explore all the conttents of the fitted model
str(fert.lm)
## List of 12
## $ coefficients : Named num [1:2] 51.933 0.811
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "FERTILIZER"
## $ residuals : Named num [1:10] 11.78 -12.5 -22.79 20.93 -5.36 ...
## ..- attr(*, "names")= chr [1:10] "1" "2" "3" "4" ...
## $ effects : Named num [1:10] -517.03 184.25 -23.73 18.66 -8.96 ...
## ..- attr(*, "names")= chr [1:10] "(Intercept)" "FERTILIZER" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:10] 72.2 92.5 112.8 133.1 153.4 ...
## ..- attr(*, "names")= chr [1:10] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:10, 1:2] -3.162 0.316 0.316 0.316 0.316 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:10] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "FERTILIZER"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.32 1.27
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 8
## $ xlevels : Named list()
## $ call : language lm(formula = YIELD ~ FERTILIZER, data = fert)
## $ terms :Classes 'terms', 'formula' language YIELD ~ FERTILIZER
## .. ..- attr(*, "variables")= language list(YIELD, FERTILIZER)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "YIELD" "FERTILIZER"
## .. .. .. ..$ : chr "FERTILIZER"
## .. ..- attr(*, "term.labels")= chr "FERTILIZER"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(YIELD, FERTILIZER)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "YIELD" "FERTILIZER"
## $ model :'data.frame': 10 obs. of 2 variables:
## ..$ YIELD : num [1:10] 84 80 90 154 148 169 206 244 212 248
## ..$ FERTILIZER: num [1:10] 25 50 75 100 125 150 175 200 225 250
## ..- attr(*, "terms")=Classes 'terms', 'formula' language YIELD ~ FERTILIZER
## .. .. ..- attr(*, "variables")= language list(YIELD, FERTILIZER)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "YIELD" "FERTILIZER"
## .. .. .. .. ..$ : chr "FERTILIZER"
## .. .. ..- attr(*, "term.labels")= chr "FERTILIZER"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(YIELD, FERTILIZER)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "YIELD" "FERTILIZER"
## - attr(*, "class")= chr "lm"
## Return the data used to fit the model
fert.lm$model
## Return the estimated coefficients
fert.lm$coefficients
## (Intercept) FERTILIZER
## 51.9333333 0.8113939
Rather than directly access these attributes (which may have different names in different modelling functions), there are extraction functions that reach into the model output and extract the information in a convienient way.
## Extract the estimated coefficients
coef(fert.lm)
## (Intercept) FERTILIZER
## 51.9333333 0.8113939
## Extract the fitted (predicted) values
fitted(fert.lm)
## 1 2 3 4 5 6 7 8
## 72.21818 92.50303 112.78788 133.07273 153.35758 173.64242 193.92727 214.21212
## 9 10
## 234.49697 254.78182
## Extract the residules
resid(fert.lm)
## 1 2 3 4 5 6 7
## 11.781818 -12.503030 -22.787879 20.927273 -5.357576 -4.642424 12.072727
## 8 9 10
## 29.787879 -22.496970 -6.781818
args(residuals.lm)
## function (object, type = c("working", "response", "deviance",
## "pearson", "partial"), ...)
## NULL
After fitting a model, it is important to explore a range of diagnostics to confirm that all the assumptions were satisfied.
When supplied with a fitted linear model, the autoplot() function generates a set of standard diagnostic regression plots:
autoplot(fert.lm, which = 1:6, ncol = 2, label.size = 3)
Conclusions:
Cook’s distance (cook.d) and leverage (hat) are displayed in tabular form.
influence.measures(fert.lm)
## Influence measures of
## lm(formula = YIELD ~ FERTILIZER, data = fert) :
##
## dfb.1_ dfb.FERT dffit cov.r cook.d hat inf
## 1 5.39e-01 -0.4561 0.5411 1.713 0.15503 0.345
## 2 -4.15e-01 0.3277 -0.4239 1.497 0.09527 0.248
## 3 -6.01e-01 0.4237 -0.6454 0.969 0.18608 0.176
## 4 3.80e-01 -0.2145 0.4634 1.022 0.10137 0.127
## 5 -5.77e-02 0.0163 -0.0949 1.424 0.00509 0.103
## 6 -2.50e-02 -0.0141 -0.0821 1.432 0.00382 0.103
## 7 -1.80e-17 0.1159 0.2503 1.329 0.03374 0.127
## 8 -2.19e-01 0.6184 0.9419 0.623 0.31796 0.176
## 9 3.29e-01 -0.6486 -0.8390 1.022 0.30844 0.248
## 10 1.51e-01 -0.2559 -0.3035 1.900 0.05137 0.345 *
The performance package provides a similar set of visual diagnotics:
performance::check_model(fert.lm)
performance::check_outliers(fert.lm)
## OK: No outliers detected.
performance::check_outliers(fert.lm) %>% plot
## These are probabilities of exceedance rather than actual Cook's D values
#https://easystats.github.io/performance/reference/check_outliers.html
Conclusions:
Although an exploration of model residuals can provide important insights into the goodness of fit and conformity of a model to the underlying assumptions, diagnosing issues is a bit of a fine art. This difficulty is exacerbated when the residuals are being calculated from some models (e.g logistic regression models) where it is difficult to separate out nefarious patterns from the specific patterns expected of the specific model.
The DHARMa package generates standardized residuals via simulation and uses these as the basis of a range of tools to diagnose common modelling issues including outliers, heterogeneity, overdispersion, autocorrelation.
New observations simulated from the fitted model are used to calculate a cummulative density function (CDF) that describes the probability profile of each observation. Thereafter, the residual of an observation is calculated as the value of the CDF that corresponds to the actual observed value:
This approach ensures that all residuals have the same interpretation irrespective of the model and distribution selected.
Exploring DHARMa residuals begins with running the simulateResiduals() function. If this is done with plot=TRUE, a pair of diagnostic plots with a range of diagnostic tests will be provided as side effects.
fert.resid <- simulateResiduals(fert.lm, plot = TRUE)
## To run tests of KS (uniformity), dispersion and outliers
testResiduals(fert.resid)
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.224, p-value = 0.6209
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.97718, p-value = 0.952
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 0, observations = 10, p-value = 1
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.0 0.1
## sample estimates:
## outlier frequency (expected: 0.011 )
## 0
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.224, p-value = 0.6209
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.97718, p-value = 0.952
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 0, observations = 10, p-value = 1
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.0 0.1
## sample estimates:
## outlier frequency (expected: 0.011 )
## 0
## OR individually
testUniformity(fert.resid)
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.224, p-value = 0.6209
## alternative hypothesis: two-sided
testDispersion(fert.resid)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.97718, p-value = 0.952
## alternative hypothesis: two.sided
testOutliers(fert.resid)
##
## DHARMa bootstrapped outlier test
##
## data: fert.resid
## outliers at both margin(s) = 0, observations = 10, p-value = 1
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.0 0.1
## sample estimates:
## outlier frequency (expected: 0.009 )
## 0
## Other useful tests
testZeroInflation(fert.resid)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = NaN, p-value = 1
## alternative hypothesis: two.sided
testQuantiles(fert.resid)
##
## Test for location of quantiles via qgam
##
## data: simulationOutput
## p-value = 0.5938
## alternative hypothesis: both
## The above fits quantile gams at 0.25, 0.5 and 0.75
## testSpatialAutocorrelation(fert.resid, x=, y=) # needs x and y coordinates
## testTemporalAutocorrelation(fert.resid, time=) # needs time
The broom package has an augment method that adds information generated from a fitted model to the modelled data.
augment(fert.lm)
## we could then pipe these to ggplot in order to look at residuals etc
augment(fert.lm) %>%
ggplot() +
geom_point(aes(y = .resid, x = .fitted))
The ssjPlot package also has plotting methods associated with fitted models, some of which focus on model diagnostics.
plot_grid(plot_model(fert.lm, type = "diag"))
Prior to examining the estimated coefficients and any associated hypthesis tests, it is a good idea to explore the fitted trends. Not only does this give you a sence for the overall outcomes (and help you interpret the model summaries etc), it also provides an opportunity to reflect on whether the model has yielded sensible patterns.
fert.lm %>% plot_model(type = "eff", show.data = TRUE)
## $FERTILIZER
plot(allEffects(fert.lm, residuals = TRUE))
fert.lm %>% ggpredict() %>% plot(add.data = TRUE)
## $FERTILIZER
fert.lm %>% ggemmeans(~FERTILIZER) %>% plot(add.data = TRUE)
summary(fert.lm)
##
## Call:
## lm(formula = YIELD ~ FERTILIZER, data = fert)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.79 -11.07 -5.00 12.00 29.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.93333 12.97904 4.001 0.00394 **
## FERTILIZER 0.81139 0.08367 9.697 1.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19 on 8 degrees of freedom
## Multiple R-squared: 0.9216, Adjusted R-squared: 0.9118
## F-statistic: 94.04 on 1 and 8 DF, p-value: 1.067e-05
Conclusions:
lm(YIELD~scale(FERTILIZER,scale=FALSE), data=fert)), then the y-intercept would have had a sensible interpretation. It would have been the expected Yield at the average Fertilizer level. Nevertheless, either way, the hypothesis test associated with the intercept is rarely of any value.Arguably, the confidence intervals associated with each of the estimates is more useful than the p-values. P-values purely indicate the probability of detecting an effect if an effect occurs. They do not provide a measure of the magnitude (or importance) of the effect. The slope is the point estimate of the magnitude of the effect. Confidence intervals then provide insignts into the broader range of effects. For example, in addition to describing the estimated effect size (slope), we could discuss the implications of slope with a magnitude as low as the lower confidence limit or as high as the upper confidence limit.
confint(fert.lm)
## 2.5 % 97.5 %
## (Intercept) 22.0036246 81.863042
## FERTILIZER 0.6184496 1.004338
Conclusions:
Each modelling routine (and package) may store and present its results in a different structure and format. In an attempt to unify the output, the broom package includes a tidy() method that produces more consistent outputs (structure and names) across modelling routines.
tidy(fert.lm, conf.int=TRUE)
We can pipe this to kable if we want more formatted output.
fert.lm %>% tidy(conf.int = TRUE) %>% kable
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 51.9333333 | 12.9790352 | 4.001325 | 0.0039425 | 22.0036246 | 81.863042 |
| FERTILIZER | 0.8113939 | 0.0836704 | 9.697499 | 0.0000107 | 0.6184496 | 1.004338 |
For HTML output, the sjPlot package has a method for producing comparible and nicely formatted outputs across different modelling routines.
# warning this is only appropriate for html output
sjPlot::tab_model(fert.lm, show.se = TRUE, show.aic = TRUE)
| Â | YIELD | |||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | p |
| (Intercept) | 51.93 | 12.98 | 22.00 – 81.86 | 0.004 |
| FERTILIZER | 0.81 | 0.08 | 0.62 – 1.00 | <0.001 |
| Observations | 10 | |||
| R2 / R2 adjusted | 0.922 / 0.912 | |||
| AIC | 91.035 | |||
For simple models prediction is essentially taking the model formula complete with parameter (coefficient) estimates and solving for new values of the predictor. To explore this, we will use the fitted model to predict Yield for a Fertilizer concentration of 110.
## establish a data set that defines the new data to predict against
newdata = data.frame(FERTILIZER = 110)
## using the predict function
predict(fert.lm, newdata = newdata)
## 1
## 141.1867
## include confidence intervals
predict(fert.lm, newdata = newdata, interval = "confidence")
## fit lwr upr
## 1 141.1867 126.3506 156.0227
Conclusions:
## establish a data set that defines the new data to predict against
newdata = data.frame(FERTILIZER = 110)
## Establish an appropriate model matrix
Xmat = model.matrix(~FERTILIZER, data = newdata)
## Perform matrix multiplication
(pred <- coef(fert.lm) %*% t(Xmat))
## 1
## [1,] 141.1867
## Calculate the standard error
se <- sqrt(diag(Xmat %*% vcov(fert.lm) %*% t(Xmat)))
## Calculate the confidence intervals
as.numeric(pred) + outer(se, qt(df = df.residual(fert.lm), c(0.025, 0.975)))
## [,1] [,2]
## 1 126.3506 156.0227
The emmeans package has a set of routines for estimating marginal means from fitted models.
## using emmeans
newdata = list(FERTILIZER = 110)
emmeans(fert.lm, ~FERTILIZER, at = newdata)
## FERTILIZER emmean SE df lower.CL upper.CL
## 110 141 6.43 8 126 156
##
## Confidence level used: 0.95
Note, the above outputs are rounded only for the purpose of constructing the printed table on screen. If we were to capture the output of the emmeans() function, the values would not be rounded.
Having fitted the simple linear model, we can further interogate the model to explore additional facets.
## testing a specific hypothesis
## Probabiliy of getting our estimate if slope was 1
multcomp::glht(fert.lm, linfct = c("FERTILIZER == 1")) %>% summary
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = YIELD ~ FERTILIZER, data = fert)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## FERTILIZER == 1 0.81139 0.08367 -2.254 0.0542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
## Cant ask probability that the slope is equal to something in frequentist
## If we wanted to know the probability that the slope was greater than
## 1, the closest we could get is
multcomp::glht(fert.lm, linfct = c("FERTILIZER >= 0.9")) %>% summary
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = YIELD ~ FERTILIZER, data = fert)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(<t)
## FERTILIZER >= 0.9 0.81139 0.08367 -1.059 0.16
## (Adjusted p values reported -- single-step method)
## testing a specific hypothesis
## Probabiliy of getting our estimate if slope was 1
brms::hypothesis(fert.lm, "FERTILIZER = 1")
## Hypothesis Tests for class :
## Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
## 1 (FERTILIZER)-(1) = 0 -0.18 0.09 -0.33 0.01 NA
## Post.Prob Star
## 1 NA
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
## Cant ask probability that the slope is equal to something in frequentist
## If we wanted to know the probability that the slope was greater than
## 1, the closest we could get is
brms::hypothesis(fert.lm, "FERTILIZER > 0.9")
## Hypothesis Tests for class :
## Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
## 1 (FERTILIZER)-(0.9) > 0 -0.08 0.08 -0.2 0.07 0.18
## Post.Prob Star
## 1 0.15
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
We could explore the expected change in Yield associated with a specific increase in Fertilizer. For example, how much do we expect Yield to increase when increasing Fertilizer concentration from 100 to 200.
newdata <- list(FERTILIZER = c(200, 100))
fert.lm %>% emmeans(pairwise~FERTILIZER, at = newdata)
## $emmeans
## FERTILIZER emmean SE df lower.CL upper.CL
## 200 214 7.97 8 196 233
## 100 133 6.78 8 117 149
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 200 - 100 81.1 8.37 8 9.697 <.0001
## or with confidence intervals
fert.lm %>% emmeans(pairwise~FERTILIZER, at = newdata) %>% confint
## $emmeans
## FERTILIZER emmean SE df lower.CL upper.CL
## 200 214 7.97 8 196 233
## 100 133 6.78 8 117 149
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df lower.CL upper.CL
## 200 - 100 81.1 8.37 8 61.8 100
##
## Confidence level used: 0.95
Although there are numerous easy to use routines that will generate partial plots (see above), it is also useful to be able to produce the data behind the figures yourself. That way, you can have more control over the type and syle of the figures.
## Using emmeans
fert_grid <- with(fert, list(FERTILIZER = seq(min(FERTILIZER), max(FERTILIZER), len = 100)))
## OR
fert_grid <- fert %>% data_grid(FERTILIZER = seq_range(FERTILIZER, n = 100))
newdata <- fert.lm %>%
emmeans(~FERTILIZER, at = fert_grid) %>%
as.data.frame
newdata <- emmeans(fert.lm, ~FERTILIZER, at = fert_grid) %>% as.data.frame
newdata %>% head
ggplot(newdata, aes(y = emmean, x = FERTILIZER))+
geom_point(data = fert, aes(y = YIELD)) +
geom_ribbon(aes(ymin = lower.CL, ymax = upper.CL), fill = "blue", alpha = 0.3) +
geom_line() +
scale_y_continuous(expression(Grass~yield~(g.m^-3)))+
scale_x_continuous(expression(Fertilizer~concentration~(g.ml^-1)))+
theme_classic()
## Using emmeans
newdata <- with(fert, data.frame(FERTILIZER = seq(min(FERTILIZER), max(FERTILIZER), len=100)))
Xmat <- model.matrix(~FERTILIZER, data = newdata)
coefs <- coef(fert.lm)
preds <- coefs %*% t(Xmat)
se <- sqrt(diag(Xmat %*% vcov(fert.lm) %*% t(Xmat)))
fert.df <- df.residual(fert.lm)
newdata <- newdata %>%
mutate(fit = as.vector(preds),
lower = fit - qt(0.975, df = fert.df)*se,
upper = fit + qt(0.975, df = fert.df)*se)
ggplot(newdata, aes(y = fit, x = FERTILIZER))+
geom_point(data = fert, aes(y = YIELD)) +
geom_ribbon(aes(ymin = lower, ymax = upper), fill = "blue", alpha =0.3) +
geom_line() +
scale_y_continuous(expression(Grass~yield~(g.m^-3)))+
scale_x_continuous(expression(Fertilizer~concentration~(g.ml^-1)))+
theme_classic()